About Dataset

The Pistachio dataset focuses on the classification of pistachio species based on 28 features that describe various geometric, color, and texture properties of the kernels. The dataset was collected and analyzed by Ozkan I.A., Koklu M., and Saracoglu R. for their research published in Progress in Nutrition (Volume 23, Issue 2, 2021).

The study aimed to classify pistachio species using an improved K-Nearest Neighbors (K-NN) classifier. The dataset includes features such as area, perimeter, axis lengths, aspect ratio, and RGB color statistics (mean, standard deviation, skewness, and kurtosis). The target classes are Kirmizi Pistachio and Siirt Pistachio, representing two pistachio species. This research aids in distinguishing pistachio types for food quality control and agricultural research.

Link to paper

About my project

The primary goal of the project is to reduce the dataset’s high-dimensional feature space (28 features) to a lower-dimensional space for better visualization and analysis while preserving the underlying structure and separability of the two pistachio species: Kirmizi Pistachio and Siirt Pistachio. Since the dataset includes complex features related to shape, size, and color, dimension reduction will help reveal meaningful patterns, clusters, and potential separations between the pistachio types.

Codes and Plots

# Packages 
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(readxl)
library(dplyr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(Rtsne)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-8
library(smacof)
## Loading required package: plotrix
## Loading required package: colorspace
## Loading required package: e1071
## 
## Attaching package: 'smacof'
## 
## The following object is masked from 'package:base':
## 
##     transform
# Working directory
setwd("E:/UWarsaw/unsupervised_learning/USL_Project/Dimension_reduction")

# Upload Pistachio data
pistachio <- read_excel("E:/UWarsaw/unsupervised_learning/USL_Project/Dimension_reduction/Pistachio_28_Features_Dataset.xlsx")

# Review Dataset
head(pistachio)
## # A tibble: 6 × 29
##    Area Perimeter Major_Axis Minor_Axis Eccentricity Eqdiasq Solidity
##   <dbl>     <dbl>      <dbl>      <dbl>        <dbl>   <dbl>    <dbl>
## 1 63391     1568.       390.       237.        0.795    284.    0.866
## 2 68358     1942.       411.       235.        0.821    295.    0.876
## 3 73589     1247.       452.       221.        0.873    306.    0.917
## 4 71106     1445.       430.       216.        0.864    301.    0.959
## 5 80087     1252.       469.       221.        0.882    319.    0.966
## 6 52268     1154.       384.       198.        0.858    258.    0.856
## # ℹ 22 more variables: Convex_Area <dbl>, Extent <dbl>, Aspect_Ratio <dbl>,
## #   Roundness <dbl>, Compactness <dbl>, Shapefactor_1 <dbl>,
## #   Shapefactor_2 <dbl>, Shapefactor_3 <dbl>, Shapefactor_4 <dbl>,
## #   Mean_RR <dbl>, Mean_RG <dbl>, Mean_RB <dbl>, StdDev_RR <dbl>,
## #   StdDev_RG <dbl>, StdDev_RB <dbl>, Skew_RR <dbl>, Skew_RG <dbl>,
## #   Skew_RB <dbl>, Kurtosis_RR <dbl>, Kurtosis_RG <dbl>, Kurtosis_RB <dbl>,
## #   Class <chr>
str(pistachio)
## tibble [2,148 × 29] (S3: tbl_df/tbl/data.frame)
##  $ Area         : num [1:2148] 63391 68358 73589 71106 80087 ...
##  $ Perimeter    : num [1:2148] 1568 1942 1247 1445 1252 ...
##  $ Major_Axis   : num [1:2148] 390 411 452 430 469 ...
##  $ Minor_Axis   : num [1:2148] 237 235 221 216 221 ...
##  $ Eccentricity : num [1:2148] 0.795 0.821 0.873 0.864 0.882 ...
##  $ Eqdiasq      : num [1:2148] 284 295 306 301 319 ...
##  $ Solidity     : num [1:2148] 0.867 0.876 0.917 0.959 0.966 ...
##  $ Convex_Area  : num [1:2148] 73160 77991 80234 74153 82929 ...
##  $ Extent       : num [1:2148] 0.639 0.677 0.713 0.703 0.746 ...
##  $ Aspect_Ratio : num [1:2148] 1.65 1.75 2.05 1.99 2.12 ...
##  $ Roundness    : num [1:2148] 0.324 0.228 0.595 0.428 0.642 ...
##  $ Compactness  : num [1:2148] 0.728 0.718 0.677 0.701 0.68 ...
##  $ Shapefactor_1: num [1:2148] 0.0062 0.006 0.0061 0.006 0.0059 0.0073 0.0054 0.0062 0.0068 0.0063 ...
##  $ Shapefactor_2: num [1:2148] 0.0037 0.0034 0.003 0.003 0.0028 0.0038 0.0035 0.0035 0.0033 0.0026 ...
##  $ Shapefactor_3: num [1:2148] 0.53 0.516 0.458 0.491 0.463 ...
##  $ Shapefactor_4: num [1:2148] 0.873 0.902 0.939 0.976 0.983 ...
##  $ Mean_RR      : num [1:2148] 196 223 213 212 230 ...
##  $ Mean_RG      : num [1:2148] 180 209 203 205 218 ...
##  $ Mean_RB      : num [1:2148] 165 187 188 188 194 ...
##  $ StdDev_RR    : num [1:2148] 17.7 26.7 19 18.2 23.4 ...
##  $ StdDev_RG    : num [1:2148] 19.6 27.2 20.1 18.7 24.1 ...
##  $ StdDev_RB    : num [1:2148] 21.1 25.1 20.7 29.8 23.1 ...
##  $ Skew_RR      : num [1:2148] 0.458 -0.385 -0.601 -0.694 -0.929 ...
##  $ Skew_RG      : num [1:2148] 0.663 -0.271 -0.45 -0.628 -0.813 ...
##  $ Skew_RB      : num [1:2148] 0.759 -0.293 0.3 -0.78 -0.497 ...
##  $ Kurtosis_RR  : num [1:2148] 2.97 1.98 3.54 2.88 2.99 ...
##  $ Kurtosis_RG  : num [1:2148] 3.06 2.1 3.69 2.87 2.88 ...
##  $ Kurtosis_RB  : num [1:2148] 2.95 2.22 4.1 2.9 2.74 ...
##  $ Class        : chr [1:2148] "Kirmizi_Pistachio" "Kirmizi_Pistachio" "Kirmizi_Pistachio" "Kirmizi_Pistachio" ...
colSums(is.na(pistachio))
##          Area     Perimeter    Major_Axis    Minor_Axis  Eccentricity 
##             0             0             0             0             0 
##       Eqdiasq      Solidity   Convex_Area        Extent  Aspect_Ratio 
##             0             0             0             0             0 
##     Roundness   Compactness Shapefactor_1 Shapefactor_2 Shapefactor_3 
##             0             0             0             0             0 
## Shapefactor_4       Mean_RR       Mean_RG       Mean_RB     StdDev_RR 
##             0             0             0             0             0 
##     StdDev_RG     StdDev_RB       Skew_RR       Skew_RG       Skew_RB 
##             0             0             0             0             0 
##   Kurtosis_RR   Kurtosis_RG   Kurtosis_RB         Class 
##             0             0             0             0
dim(pistachio)
## [1] 2148   29
# Scaling 
pistachio_scaled <- scale(select(pistachio, -Class))

General structure of the dataset has been checked and scaled. Next stage is performing PCA Before starting PCA you need to know that PCA requires numerical data because it relies on mathematical operations such as computing the covariance matrix, eigenvalues, and eigenvectors, which cannot be performed on non-numerical data. Categorical or text data cannot be directly processed in these calculations. Additionally, PCA projects data into a new space by computing distances between data points along the principal components, and only numerical values can define meaningful distances in this context.

# Performing PCA
pca_out <- prcomp(pistachio_scaled, center = TRUE, scale. = FALSE)
summary(pca_out)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.9907 2.3413 2.0709 1.6774 1.61240 0.98869 0.81147
## Proportion of Variance 0.3195 0.1958 0.1532 0.1005 0.09285 0.03491 0.02352
## Cumulative Proportion  0.3195 0.5152 0.6684 0.7689 0.86172 0.89663 0.92014
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.72381 0.63638 0.6034 0.52440 0.43074 0.39738 0.26160
## Proportion of Variance 0.01871 0.01446 0.0130 0.00982 0.00663 0.00564 0.00244
## Cumulative Proportion  0.93885 0.95332 0.9663 0.97614 0.98277 0.98841 0.99085
##                           PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.22385 0.2183 0.19957 0.17952 0.16882 0.16318 0.11266
## Proportion of Variance 0.00179 0.0017 0.00142 0.00115 0.00102 0.00095 0.00045
## Cumulative Proportion  0.99264 0.9943 0.99577 0.99692 0.99793 0.99889 0.99934
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.10261 0.07269 0.03442 0.02696 0.02567 0.00866 0.0069
## Proportion of Variance 0.00038 0.00019 0.00004 0.00003 0.00002 0.00000 0.0000
## Cumulative Proportion  0.99972 0.99990 0.99995 0.99997 1.00000 1.00000 1.0000
# Choosing optimal number of PCs
fviz_eig(pca_out, addlabels = TRUE, ylim = c(0, 100)) +
  labs(title = "Scree Plot: Explained Variance by Dimensions")

Initially, exact number of dimensions to which the dataset needs to be reduced to has not been set. The purpose was to check what percent of the variance was explained by each Principal Components. From the scree plot and cumulative proportion, it can be seen that 6 dimension captures about 90% of variance which is accurate enough to further continue our analysis. However, we can’t plot 6 dimension. Therefore, for only visualization purposes, 3 dimension which captures about 67% variance is used. Generally, 67% is also acceptable with high dimensional datasets.

# 2D PCA Scatter Plot (PC1 vs PC2)
fviz_pca_ind(pca_out, label = "none", addEllipses = TRUE) +
  labs(title = "2D PCA Plot: PC1 vs PC2")

# 3D PCA Scatter Plot (PC1, PC2, PC3)
# Extracting PCA scores (principal component values for each sample)
pca_scores <- as.data.frame(pca_out$x)

# Create interactive 3D scatter plot
plot_ly(
  data = pca_scores,
  x = ~PC1,
  y = ~PC2,
  z = ~PC3,
  type = "scatter3d",
  mode = "markers",
  marker = list(size = 3, color = ~PC1, colorscale = "Viridis")
) %>%
  layout(
    title = list(
      text = "3D PCA Plot: PC1, PC2, and PC3",
      x = 0.5,
      y = 0.9
    ),
    scene = list(
      xaxis = list(title = "PC1-31,9%"),
      yaxis = list(title = "PC2-19,5%"),
      zaxis = list(title = "PC3-15,3%")
    )
  )

The 2D PCA plot visualizes the data using the first two principal components, which explain 31.9% and 19.6% of the variance, respectively. The points appear densely clustered near the center, with no distinct separation, suggesting that the first two components capture some variance but may not fully distinguish different subgroups or patterns. The circular boundary represents the confidence ellipse, indicating the general spread of the data. The outlying points suggest possible outliers or samples with unique properties that may warrant further investigation. This plot indicates that additional dimensions or nonlinear methods may be needed for clearer separation.

The 3D PCA plot visualizes the data using the first three principal components, which together explain a larger portion of the variance (31.9%, 19.5%, and the third component’s contribution). The points remain densely packed with no clear separation into distinct clusters, indicating that even with an additional dimension, the data does not show strong linear separation. Some outlying points are visible, reinforcing the possibility of unique samples or outliers. The 3D plot provides more depth compared to the 2D plot, but still it suggests that additional analysis, is needed to capture meaningful patterns.

How much each variable has contributed to each Principal Component can be observed on the following diagram?

# Contribution
pca_out$rotation
##                       PC1         PC2          PC3         PC4          PC5
## Area           0.28913177 -0.06297096  0.086195324 -0.18484522  0.185830858
## Perimeter     -0.03033397 -0.29517268 -0.053972083  0.05456877  0.340974456
## Major_Axis     0.15619747  0.11835062 -0.002569645 -0.20607705  0.464257224
## Minor_Axis     0.23853127 -0.27981213  0.077038306 -0.01876012  0.078736979
## Eccentricity  -0.13535753  0.34504602 -0.082054341 -0.10888076  0.177151466
## Eqdiasq        0.29145981 -0.06012621  0.079949455 -0.18785142  0.185315183
## Solidity       0.16897531  0.27600105  0.081713350 -0.19817265 -0.217980648
## Convex_Area    0.25292707 -0.16590536  0.060533667 -0.12776281  0.274804285
## Extent         0.15535932  0.16632048  0.017313212 -0.19550121 -0.201030208
## Aspect_Ratio  -0.16275575  0.33255667 -0.067968072 -0.08396291  0.188815670
## Roundness      0.14447874  0.26247690  0.085041027 -0.07930804 -0.263635381
## Compactness    0.21948813 -0.22270226  0.111583714 -0.01819257 -0.293879435
## Shapefactor_1 -0.29640289  0.12813065 -0.081499218  0.14266153  0.015718818
## Shapefactor_2 -0.18854313 -0.23618024 -0.030082974  0.29335708 -0.189902393
## Shapefactor_3  0.21594148 -0.22522924  0.115044867 -0.01578866 -0.294428293
## Shapefactor_4  0.13545144  0.27630950  0.076107808 -0.23756782 -0.215647566
## Mean_RR        0.25274902  0.07577850 -0.206047175  0.17370985  0.008871138
## Mean_RG        0.24938292  0.10185096 -0.214274820  0.19071146  0.014509468
## Mean_RB        0.20483864  0.09185584 -0.182960134  0.17431644 -0.005493298
## StdDev_RR      0.06931134 -0.01630882 -0.435740165 -0.03797845 -0.043051844
## StdDev_RG      0.06457383 -0.04655884 -0.440075499 -0.04459114 -0.027282079
## StdDev_RB      0.02339150  0.00749253 -0.367360642 -0.10080009 -0.013826638
## Skew_RR       -0.23061009 -0.14530255  0.038725751 -0.31539277 -0.029758661
## Skew_RG       -0.22858059 -0.10810178  0.063181614 -0.31310306 -0.026432261
## Skew_RB       -0.16000173 -0.06793593  0.190174258 -0.25691020  0.005987687
## Kurtosis_RR    0.12169323  0.18537106  0.233611029  0.31846158  0.094973983
## Kurtosis_RG    0.07786857  0.16304876  0.273281273  0.31392419  0.112239162
## Kurtosis_RB   -0.02965340  0.09384878  0.312769204  0.16824830  0.104548814
##                        PC6         PC7         PC8          PC9        PC10
## Area          -0.002285518  0.05425644 -0.04065087  0.007277396 -0.05590902
## Perimeter      0.034269003 -0.42841098  0.23946687  0.169476946  0.14563986
## Major_Axis     0.033377073  0.13519284 -0.03276423 -0.118954547 -0.04706267
## Minor_Axis    -0.020203548  0.10982701 -0.03329461 -0.117668810 -0.05041666
## Eccentricity   0.042386517 -0.04969499  0.01646208  0.089911204  0.02499787
## Eqdiasq       -0.002502840  0.04726496 -0.04595697  0.002228012 -0.03123970
## Solidity       0.012383003 -0.10963492  0.02268424  0.234417976  0.10819785
## Convex_Area   -0.000202289  0.08935178 -0.05277081 -0.083531298 -0.08176783
## Extent         0.050114751 -0.43769434  0.39521454 -0.708129837 -0.10414906
## Aspect_Ratio   0.045729427 -0.02452996  0.02532192  0.040006545 -0.01810074
## Roundness     -0.033692891  0.46866504 -0.19836926 -0.201291579 -0.13318133
## Compactness   -0.044436179 -0.09794363 -0.02820301  0.136403009  0.03329117
## Shapefactor_1  0.006501588  0.03744699  0.05514592 -0.048279800 -0.07514503
## Shapefactor_2 -0.031769004  0.10205635  0.04068132 -0.181605859 -0.04255084
## Shapefactor_3 -0.048309704 -0.09488118 -0.02636305  0.137504120  0.02185387
## Shapefactor_4  0.009494311 -0.25696839 -0.01820935  0.370101230  0.08765058
## Mean_RR       -0.074693012  0.06956055 -0.00548522 -0.075864766  0.27671289
## Mean_RG       -0.123358107  0.10641957  0.17661355  0.037781439  0.04103912
## Mean_RB       -0.215399859  0.09742597  0.47360006  0.299352216 -0.57163079
## StdDev_RR     -0.140542246 -0.08697782 -0.16380925 -0.073709005 -0.07430165
## StdDev_RG     -0.222173863 -0.08057963 -0.07869155  0.007588307 -0.10410540
## StdDev_RB     -0.423576208 -0.11056833 -0.29146578 -0.086974210  0.28830007
## Skew_RR       -0.223308413  0.02794644  0.03032666  0.043374074 -0.26162156
## Skew_RG       -0.340591942  0.07952590  0.09253982  0.026753478 -0.12810204
## Skew_RB       -0.455806639  0.20214030  0.33050367 -0.033891437  0.35145656
## Kurtosis_RR   -0.272168451  0.02141692  0.12230856 -0.025008248  0.19601883
## Kurtosis_RG   -0.334882127 -0.05648595  0.03485296 -0.039204387  0.08448255
## Kurtosis_RB   -0.329514097 -0.38294721 -0.47121852 -0.063921590 -0.38320380
##                       PC11         PC12         PC13          PC14         PC15
## Area           0.028081741 -0.044069881 -0.004631123 -0.1787031824  0.050516759
## Perimeter     -0.042711194  0.004976416  0.104506925  0.0979564559 -0.216786936
## Major_Axis     0.005277951 -0.009726092 -0.019505933  0.0023177266 -0.156799860
## Minor_Axis     0.027286899 -0.013742760 -0.006751634 -0.0562460910 -0.131144572
## Eccentricity  -0.039329561  0.017635018  0.020393373  0.4920988961  0.351962072
## Eqdiasq        0.011816034 -0.031601697 -0.013079443 -0.0253633424  0.012282714
## Solidity      -0.056079474 -0.024124241  0.060404253  0.0704235154 -0.571277031
## Convex_Area    0.044202571 -0.026365230 -0.032846945 -0.1654072251  0.185796234
## Extent         0.058098801 -0.002335320  0.020109133 -0.0042140104  0.043630044
## Aspect_Ratio   0.012926216 -0.005545593  0.002126392 -0.5064657342 -0.133808057
## Roundness     -0.011934666 -0.008180235  0.003813607  0.1072754597 -0.093916984
## Compactness    0.002272040 -0.026617239 -0.008303979 -0.0252968860  0.112987087
## Shapefactor_1  0.038178833 -0.019066967  0.037678397 -0.5424866902 -0.005192600
## Shapefactor_2  0.022950999  0.020042699  0.025462302 -0.0280337805 -0.109503883
## Shapefactor_3  0.011198364 -0.036579081 -0.007535280 -0.1756390217  0.081849041
## Shapefactor_4 -0.014426867 -0.032942518 -0.007832006 -0.2196219365  0.233214420
## Mean_RR       -0.314858091  0.215088459  0.528599393 -0.1230077563  0.102044423
## Mean_RG       -0.087292850  0.228231981  0.331637070  0.0338541678 -0.058486367
## Mean_RB        0.224388677  0.161569595 -0.140340035  0.0462234345 -0.041873517
## StdDev_RR     -0.454883686 -0.240700566 -0.225751393  0.0445470064 -0.198137505
## StdDev_RG     -0.226299630 -0.208103536 -0.232368342 -0.0927681474  0.279745842
## StdDev_RB      0.667201349  0.095638039 -0.010925110  0.0272488473 -0.118228328
## Skew_RR       -0.066246901 -0.251784392  0.360473452  0.0328859508 -0.148582096
## Skew_RG        0.011659638 -0.159349401  0.431482784  0.0211923152  0.160242905
## Skew_RB       -0.270077843  0.394637120 -0.379739423 -0.0512137280 -0.005552488
## Kurtosis_RR    0.112413762 -0.353197371  0.014419120 -0.0243541812  0.258182799
## Kurtosis_RG   -0.079535771 -0.451520388 -0.071658874  0.0533862844 -0.218402399
## Kurtosis_RB   -0.164309962  0.427983118  0.028291337  0.0003134657  0.005924672
##                      PC16         PC17         PC18          PC19         PC20
## Area           0.06131033  0.273439209 -0.012037207 -0.2363637307 -0.169561496
## Perimeter     -0.09657846  0.446312163 -0.141574477  0.3828809667  0.196469251
## Major_Axis    -0.08580252 -0.126682336  0.058643034 -0.0094310461  0.006589356
## Minor_Axis    -0.09146242  0.017695164  0.006731446 -0.0287869067 -0.082462628
## Eccentricity   0.18541645  0.355273354 -0.070137362 -0.1926159124 -0.194730412
## Eqdiasq        0.01592939  0.084023947  0.009623186 -0.1073971002 -0.074451378
## Solidity      -0.36439846  0.115969365  0.053213349 -0.1721794180 -0.348694720
## Convex_Area    0.11401000  0.117217033 -0.023952039 -0.0831185163 -0.012844308
## Extent         0.02331302 -0.007644697 -0.014593492  0.0018228251 -0.001255724
## Aspect_Ratio  -0.02968265 -0.137047838  0.005057367  0.1272556481  0.145846879
## Roundness     -0.04096330  0.419570624 -0.192273375  0.4495761389  0.239222534
## Compactness    0.08399346  0.060320019 -0.024902058 -0.0051023312  0.039994942
## Shapefactor_1  0.08714134  0.450288685 -0.084907908 -0.1455645506 -0.109508619
## Shapefactor_2 -0.05809043  0.255905295  0.041346855 -0.3075574427 -0.244795168
## Shapefactor_3  0.09924277  0.080247576 -0.034892215  0.0322231279  0.088544364
## Shapefactor_4  0.16167114  0.034270437 -0.014201134 -0.0374291409  0.118458156
## Mean_RR        0.19283886 -0.040607636  0.226794080  0.3013176534 -0.358557475
## Mean_RG       -0.09276631 -0.073685002 -0.435034521 -0.4515623998  0.443044167
## Mean_RB        0.11461826 -0.005287054  0.203564829  0.1480051133 -0.109027154
## StdDev_RR      0.19026112  0.148355462  0.423636306 -0.1740320998  0.319529459
## StdDev_RG     -0.42663890 -0.087742835 -0.418202792  0.1566998039 -0.303131710
## StdDev_RB      0.12482923  0.033408749  0.013607851  0.0130573708 -0.008649553
## Skew_RR        0.26496594 -0.103365753 -0.288830560 -0.0006293718 -0.068810905
## Skew_RG       -0.27431119  0.032758793  0.324030846 -0.0227007957  0.117857110
## Skew_RB        0.05415237  0.024758291 -0.035287300 -0.0049780440 -0.011428530
## Kurtosis_RR   -0.36967688  0.093794449  0.220952387 -0.0354451089  0.143963977
## Kurtosis_RG    0.38078316 -0.122941923 -0.210995605  0.0352273131 -0.127791601
## Kurtosis_RB   -0.10431399  0.024768554  0.008087053 -0.0144410489  0.009313199
##                       PC21          PC22          PC23          PC24
## Area          -0.039008931 -0.0232472162 -0.0095916631  1.937806e-01
## Perimeter     -0.108572535 -0.0398663125  0.0293134352  9.460242e-03
## Major_Axis    -0.280871399 -0.0168171706 -0.4979427833  1.362089e-01
## Minor_Axis     0.270145763  0.0362778876  0.3393995609 -3.482126e-01
## Eccentricity   0.121076539  0.0151786421 -0.1180669995 -3.861637e-01
## Eqdiasq       -0.050172250 -0.0070648622 -0.1492419076  9.283122e-02
## Solidity       0.118850065  0.0404758417  0.0633151795 -5.908399e-02
## Convex_Area   -0.023399085 -0.0464522870  0.3856468473 -2.008856e-01
## Extent         0.011353834 -0.0008152751 -0.0055095005  1.433977e-03
## Aspect_Ratio  -0.100628741 -0.0271987714 -0.0468931654 -6.043783e-01
## Roundness     -0.101119659 -0.0466275507  0.0368633945  6.321902e-04
## Compactness    0.051926617  0.0072598629 -0.2377933253 -1.175122e-01
## Shapefactor_1  0.353312377  0.0369746951 -0.1076243931  2.826513e-01
## Shapefactor_2 -0.656149138 -0.1209894989 -0.0296944359 -2.174703e-01
## Shapefactor_3  0.123168210  0.0258397188 -0.5367798803 -2.977970e-01
## Shapefactor_4 -0.413462087 -0.1039530651  0.2866298837  1.600134e-01
## Mean_RR       -0.032747094  0.0612921377 -0.0028418992  5.580355e-03
## Mean_RG        0.039971478 -0.0681710539  0.0056329007 -4.887703e-03
## Mean_RB       -0.015171575  0.0213530913  0.0041107213  1.796909e-03
## StdDev_RR      0.024916621  0.0878503768  0.0317228693 -1.225328e-02
## StdDev_RG     -0.031665226 -0.0765448082 -0.0291838849  8.636518e-03
## StdDev_RB     -0.002747448  0.0219827255  0.0065173237 -1.271332e-03
## Skew_RR       -0.122873465  0.5436520396  0.0362682599 -3.883946e-03
## Skew_RG        0.104942778 -0.4901692651 -0.0220609138 -6.253278e-05
## Skew_RB       -0.006629953  0.0668017770  0.0080465577 -6.093192e-04
## Kurtosis_RR   -0.067752848  0.4808432479  0.0152939651 -1.047456e-02
## Kurtosis_RG    0.054996377 -0.4064229133  0.0008725263  1.064212e-03
## Kurtosis_RB   -0.003567403  0.0463938625 -0.0042043914  7.877424e-04
##                        PC25          PC26          PC27          PC28
## Area           0.6126758452  1.292826e-01  0.3866950383  2.160376e-01
## Perimeter      0.0134964535  1.285104e-02  0.0015295344  2.955902e-03
## Major_Axis    -0.4042216416 -1.740608e-01  0.1304042087  2.414773e-01
## Minor_Axis    -0.4312730729  4.142270e-01  0.2538362211  2.278351e-01
## Eccentricity  -0.0791727178  4.933860e-02  0.0330684977  5.296949e-02
## Eqdiasq       -0.0140068027  4.997904e-01 -0.5779378308 -4.313204e-01
## Solidity      -0.0041091703 -2.260417e-01 -0.0657425693 -6.905971e-02
## Convex_Area   -0.0594098350 -6.468441e-01 -0.1924485635 -1.944631e-01
## Extent         0.0011893665  3.074050e-04 -0.0002610291  3.288658e-04
## Aspect_Ratio   0.2559370054  1.282081e-01 -0.1121452215  7.556047e-02
## Roundness      0.0194526016  8.934920e-03 -0.0008270591  3.977436e-03
## Compactness    0.0302842277 -1.007388e-01 -0.4755381540  6.584611e-01
## Shapefactor_1 -0.2809054000 -2.978942e-02 -0.0908543838  4.469970e-02
## Shapefactor_2 -0.0517104741  3.842855e-02 -0.0205705037 -2.030698e-02
## Shapefactor_3 -0.0813513889 -1.102152e-01  0.3541538949 -4.107976e-01
## Shapefactor_4 -0.3314572970  1.351323e-01  0.1403376375  3.460275e-02
## Mean_RR       -0.0056655743 -1.921297e-03  0.0012071046  5.900681e-04
## Mean_RG        0.0053039885 -2.593827e-03 -0.0023936043 -2.297032e-03
## Mean_RB       -0.0028978397 -1.153382e-03  0.0005076938  6.551126e-04
## StdDev_RR      0.0054829636 -1.776369e-03 -0.0025278872  6.385271e-04
## StdDev_RG     -0.0063479304  5.143497e-03  0.0056134006  8.559803e-04
## StdDev_RB      0.0024383903  4.795006e-04 -0.0009891157  2.249830e-04
## Skew_RR       -0.0024626264 -5.114633e-03 -0.0010160337 -1.228732e-03
## Skew_RG        0.0009508219  7.461220e-04  0.0010854370  1.412459e-03
## Skew_RB        0.0012786058 -9.751178e-06 -0.0005527591 -2.307304e-04
## Kurtosis_RR    0.0045009397 -1.993610e-03 -0.0005913151 -7.434045e-06
## Kurtosis_RG   -0.0020063138  1.592403e-03  0.0009252629  1.031406e-03
## Kurtosis_RB    0.0011414091 -2.951987e-04  0.0010697130  1.193391e-04
fviz_pca_var(pca_out, col.var = "contrib")

The first principal component (Dim1) primarily captures color-related variability (skewness and standard deviation in RGB channels), while the second component (Dim2) captures geometric features (like perimeter and shape factors). The longer arrows indicate that these features significantly contribute to the variability explained by the first two components, while features near the origin (like Eccentricity) contribute very little to these components.

# View loadings (contributions of features to each principal component)
pca_loadings <- pca_out$rotation  # Loadings matrix 
# Bar plot for PC6 loadings
barplot(sort(pca_loadings[, 6], decreasing = TRUE), las = 2, col = "khaki",
        main = "Feature Contributions to PC6", ylab = "Contribution", cex.names = 0.7)

The bar plot represents the contributions of individual features to the sixth principal component. The length of each bar indicates how strongly a feature contributes to PC6. Negative values (bars extending to the left) indicate that the feature contributes in the opposite direction of positive values (bars extending to the right), though the sign does not affect importance—only the magnitude matters.

Top Contributing Features: The features Skew_RB, Skew_RG, and StdDev_RG have the largest contributions (in absolute terms) to PC6. These features describe color-based skewness and standard deviation, suggesting that PC6 is strongly associated with variations in color properties rather than geometric features.

Low-Contributing Features: Features such as Extent, Aspect_Ratio, Eccentricity, and Perimeter contribute minimally to PC6, indicating that shape-related characteristics have little influence on this component. Possible Interpretation of PC6:

Since the top contributing features are related to color variations, PC6 likely captures differences in pistachio kernel color distribution, especially in how the red and green channels deviate from normality (measured by skewness) or vary in spread (measured by standard deviation). The minimal contribution of geometric features implies that PC6 does not significantly reflect differences in the shape or size of the kernels.

Since PCA is linear, other methods such as t-SNE and MDS can be applied to see if non-linear methods provide more clear separation between the pistachio classes in a reduced dimensions

# Perform t-SNE (2D)
set.seed(123)  # For reproducibility
tsne_out <- Rtsne(pistachio_scaled, dims = 2, perplexity = 30, verbose = TRUE, max_iter = 1000)
## Performing PCA
## Read the 2148 x 28 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 1.20 seconds (sparsity = 0.058468)!
## Learning embedding...
## Iteration 50: error is 78.899979 (50 iterations in 0.55 seconds)
## Iteration 100: error is 76.390976 (50 iterations in 0.48 seconds)
## Iteration 150: error is 75.743611 (50 iterations in 0.52 seconds)
## Iteration 200: error is 75.740763 (50 iterations in 0.48 seconds)
## Iteration 250: error is 75.741065 (50 iterations in 0.52 seconds)
## Iteration 300: error is 2.013454 (50 iterations in 0.49 seconds)
## Iteration 350: error is 1.793940 (50 iterations in 0.50 seconds)
## Iteration 400: error is 1.705625 (50 iterations in 0.47 seconds)
## Iteration 450: error is 1.662918 (50 iterations in 0.45 seconds)
## Iteration 500: error is 1.643943 (50 iterations in 0.44 seconds)
## Iteration 550: error is 1.632196 (50 iterations in 0.48 seconds)
## Iteration 600: error is 1.625159 (50 iterations in 0.49 seconds)
## Iteration 650: error is 1.620462 (50 iterations in 0.46 seconds)
## Iteration 700: error is 1.615599 (50 iterations in 0.51 seconds)
## Iteration 750: error is 1.611384 (50 iterations in 0.49 seconds)
## Iteration 800: error is 1.606686 (50 iterations in 0.49 seconds)
## Iteration 850: error is 1.602569 (50 iterations in 0.49 seconds)
## Iteration 900: error is 1.598953 (50 iterations in 0.48 seconds)
## Iteration 950: error is 1.596958 (50 iterations in 0.49 seconds)
## Iteration 1000: error is 1.594872 (50 iterations in 0.49 seconds)
## Fitting performed in 9.76 seconds.
# Convert t-SNE outpt to dataframe
tsne_df <- data.frame(tsne_out$Y) 
tsne_df$Class <- pistachio$Class  
colnames(tsne_df) <- c("Dim1", "Dim2", "Class")  

# Plot t-SNE result
ggplot(tsne_df, aes(x = Dim1, y = Dim2, color = Class)) +
  geom_point(size = 2) +
  labs(title = "t-SNE Plot", x = "t-SNE Dim 1", y = "t-SNE Dim 2") +
  theme_minimal()

The t-SNE plot shows the distribution of the two pistachio species, Kirmizi_Pistachio and Siirt_Pistachio, in the reduced 2D space. There is some overlap between the two classes, indicating that the features may not fully separate the species in this projection. However, distinct regions dominated by one class suggest that t-SNE captures local differences, meaning some features contribute to separating the species. Isolated points away from the main clusters may be outliers or samples with unique properties. Compared to PCA, t-SNE often shows better-defined clusters, indicating that nonlinear relationships exist in the data that PCA may not fully capture. If we want clearer separation, we can experiment with different t-SNE perplexity values or compare the results to MDS.

# Classical MDS
# Compute distance matrix
dist_matrix <- dist(pistachio_scaled)

# Classical MDS (cMDS) for 2D projection
cmds_out <- cmdscale(dist_matrix, k = 2)  # k = number of dimensions (2 for 2D)

# Dataframe with MDS results
mds_df <- data.frame(cmds_out)
mds_df$Class <- pistachio$Class  # Add class labels
colnames(mds_df) <- c("Dim1", "Dim2", "Class")

# Plot Classical MDS result
ggplot(mds_df, aes(x = Dim1, y = Dim2, color = Class)) +
  geom_point(size = 2) +
  labs(title = "Classical MDS Plot", x = "MDS Dim 1", y = "MDS Dim 2") +
  theme_minimal()

There are several methods of MDS test. One of the well known ones is Classical MDS which is used when there is no linear relationship. Also it is faster than other MDS methods which gives the ability to work faster with large dataset. If still overlapping is being observed the we can proceed with other methods of MDS. Although there is overlapping, it is more clear then t-SNE.

While some areas have more distinct regions dominated by one class, the overall structure indicates that the pistachio species share many similar geometric and color characteristics in this distance-based projection. Classical MDS is based on linear transformations, which may not be capturing the nonlinear aspects of the data.

# Metric MDS
# Perform Metric MDS (2D)
set.seed(123)  # For reproducibility
metric_mds_out <- smacofSym(dist_matrix, ndim = 2)  # 'ndim = 2' for 2D projection

# Create a dataframe with MDS results
metric_mds_df <- data.frame(metric_mds_out$conf)  # 'conf' contains the 2D coordinates
metric_mds_df$Class <- pistachio$Class  # Add the original class labels
colnames(metric_mds_df) <- c("Dim1", "Dim2", "Class")

# Plot Metric MDS result
ggplot(metric_mds_df, aes(x = Dim1, y = Dim2, color = Class)) +
  geom_point(size = 2) +
  labs(title = "Metric MDS Plot", x = "MDS Dim 1", y = "MDS Dim 2") +
  theme_minimal()

# Non-Metric MDS
# Perform Non-Metric MDS (2D)
set.seed(123)  # For reproducibility
nonmetric_mds_out <- smacofSym(dist_matrix, ndim = 2, type = "ordinal")  # "ordinal" for non-metric MDS

# Create a dataframe with MDS results
nonmetric_mds_df <- data.frame(nonmetric_mds_out$conf)
nonmetric_mds_df$Class <- pistachio$Class  # Add the original class labels
colnames(nonmetric_mds_df) <- c("Dim1", "Dim2", "Class")

# Plot Non-Metric MDS result
ggplot(nonmetric_mds_df, aes(x = Dim1, y = Dim2, color = Class)) +
  geom_point(size = 2) +
  labs(title = "Non-Metric MDS Plot", x = "MDS Dim 1", y = "MDS Dim 2") +
  theme_minimal()

All three MDS methods indicate a high degree of similarity between the two pistachio species in terms of the selected features. The overlapping clusters suggest that the dataset may require additional features, feature transformation, or different distance metrics for better separation. Nonlinear methods like t-SNE showed slightly better separation, indicating that the relationships may be too complex for MDS to capture effectively.

Conclusion

The overall evaluation of the project indicates that while PCA, t-SNE, and MDS provided valuable insights, the two pistachio classes (Kirmizi_Pistachio and Siirt_Pistachio) showed significant overlap in most dimension reduction methods. PCA captured some variance, but nonlinear methods like t-SNE provided better class separation, highlighting the complexity of the feature space. MDS, both metric and non-metric, struggled to differentiate the classes, reinforcing the idea that linear distance-based projections may not be sufficient. This suggests that the dataset may benefit from additional features, alternative distance metrics, or more advanced nonlinear techniques for improved separation. The project demonstrated the strengths and limitations of various dimension reduction methods in real-world classification problems.

Literature Review

Author

  1. Gareth James
  2. Daniela Witten
  3. Trevor Hastie
  4. Robert Tibshirani

An Introduction to Statistical Learning with Applications in R

Key findings from literature

What is PCA in Dimension Reduction?

When faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set. It finds a low-dimensional representation of a data set that contains as much as possible of the variation. The idea is that each of the n observations lives in p-dimensional space, but not all of these dimensions are equally interesting. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension. For instance, if we can obtain a two-dimensional representation of the data that captures most of the information, then we can plot the observations in this low-dimensional space.

Another Interpretation of PCA

In the previous section, we describe the principal component loading vectors as the directions in feature space along which the data vary the most, and the principal component scores as projections along these directions. However, an alternative interpretation for principal components can also be useful: principal components provide low-dimensional linear surfaces that are closest to the observations. We expand upon that interpretation here. The first principal component loading vector has a very special property: it is the line in p-dimensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness). The appeal of this interpretation is clear: we seek a single dimension of the data that lies as close as possible to all of the data points, since such a line will likely provide a good summary of the data.

Uniqueness of the Principal Components

Each principal component loading vector is unique, up to a sign flip. This means that two different software packages will yield the same principal component loading vectors, although the signs of those loading vectors may differ. The signs may differ because each principal component loading vector specifies a direction in p-dimensional space: flipping the sign has no effect as the direction does not change.

How to find appropriate number of Principal Components

We typically decide on the number of principal components required to visualize the data by examining a scree plot. We choose the smallest number of principal components that are required in order to explain a sizable amount of the variation in the data. This is done by eyeballing the scree plot, and looking for a point at which the proportion of variance explained by each subsequent principal component drops off. This is often referred to as an elbow in the scree plot.

Unfortunately, there is no well-accepted objective way to decide how many principal components are enough. In fact, the question of how many principal components are enough is inherently ill-defined, and will depend on the specific area of application and the specific data set. In practice, we tend to look at the first few principal components in order to find interesting patterns in the data. If no interesting patterns are found in the first few principal components, then further principal components are unlikely to be of interest. Conversely, if the first few principal components are interesting, then we typically continue to look at subsequent principal components until no further interesting patterns are found. This is admittedly a subjective approach, and is reflective of the fact that PCA is generally used as a tool for exploratory data analysis.